Rainbow Tours S.A. is a Polish travel agency that has been operating since 1990. The company offers lots of different types of trips, including all-inclusive vacations and guided tours, that vary in terms of destination. Most of the travels are organized within Europe or in some popular place outside Europe, for instance to Turkey. Offerts available on the agency website are marked by different popularity among the company customers. The interest in purchasing a specific travel package is reflected in a number of ratings given to the offer by users.
The main objective of the following project is to train several ML models being able to predict whether a specific trip is popular among the agency clients. Furthermore, this project aims to recognise important factors that cheer the trip up, make it encouraging and appealing to a wider range of customers and, as a resut, more popular in general.
To achieve the final goals, the first step involved gathering data. The final dataset was scrapped (extracted from the html code) from the Rainbow Tour website and consists of more than 2 000 travel offerts organized in 2023. Next, the dataset was appropriately prepared and number of reviews on the website was choosen as a criterion of each travel popularity. Because the popular trips are in minority of the agency offer, different methods of dealing with unbalanced data were applied.
In order to better understand and interpret the way that the proposed models make the classification, different methods of explanation and examination of predictive machine learning models were used, inluding Partial Dependence Profiles, Ceteris-Paribus Profiles, Break-Down plots and Shapley Additive Explanations.
The first step of the project involves gathering the data from the Rainbow Tour website. It was done with a web-scraping technique using selenium and beautiful-soup packages. The code used for that purpose is presented below.
# web-scraping packages
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import requests
import time
import re
import math
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.inspection import permutation_importance
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_auc_score, roc_curve
import dalex as dx
def notebook_setup():
# set pandas to display 3 decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# disable warnings
warnings.filterwarnings('ignore')
# plot formatting
sns.set_style("whitegrid")
sns.set(font_scale=1.1)
notebook_setup()
seed = 123
# open web browser
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.session_id
# open the Rainbow Tour wenstite
PAGE_URL = 'https://r.pl/szukaj?sezon=lato-2023'
driver.get(PAGE_URL)
driver.maximize_window()
# accept cookies
btn = driver.find_element(By.XPATH, "//*[contains(text(), 'Akceptuj wszystkie')]")
btn.click()
Since the number of displayed offers on the Rainbow website is limited to 10, the code below was used to automatically expand the list of offers using the 'Show More' button. For this purpose, the Selenium library was employed.
The following code allows for the opening of a browser window and expanding the list of offers by clicking the "Show More" button a defined number of times. The "Show More" button on the Rainbow website disappears from view after being clicked, remaining invisible until the website reloads three times to show additional. Website starts to reload offers automatically after scrolling to the bottom of the list. The code considers this behavior and, after clicking the "Show More" button multiple times, scrolls the page up and down to facilitate the loading of additional offers. It then waits for a specified number of seconds for the button to reappear. After fully expanding the list of offers, the HTML code of the page was extracted.
As a result of web scraping, the HTML code of the page containing 2098 offers was obtained, from which essential information about the trips was extracted. The extracted information is stored in a datframe and saved in .csv file.
# expand list of offerts MAX_I times
SHOW_MORE_CLASS_NAME = 'szukaj-bloczki__load-more-button'
MAX_I, MAX_J = 100, 5
SLEEP_TIME = 5
for i in range(MAX_I):
try:
# click on "Show more" button
expand_bar = driver.find_element(By.CLASS_NAME, SHOW_MORE_CLASS_NAME)
expand_bar.click()
for j in range(MAX_J):
# scroll page up and down to trigger the offerts loading
actions = ActionChains(driver)
actions.send_keys(Keys.PAGE_UP).perform()
actions.send_keys(Keys.PAGE_DOWN).perform()
# wait SLEEP_TIME seconds for website to load
time.sleep(SLEEP_TIME)
print(f"List expanded {i+1} times")
except Exception as ex:
print(f"Exception during expanding list: {ex}")
break
# retrieve html content
html_content = driver.page_source
# save html to text file
# with open('page_html_content.txt', 'w', encoding="utf-8") as file:
# file.write(html_content)
# close the browser window
driver.close()
# parse html content with bs4
soup = BeautifulSoup(html_content, 'html.parser')
# find all blocks with offert
OFFER_BLOCK_CLASS_NAME = 'n-bloczek szukaj-bloczki__element'
offer_blocks = soup.find_all("a", {"class": OFFER_BLOCK_CLASS_NAME})
# class names of proper html tags
TITLE = re.compile('.*r-bloczek-naglowek__tytul.*')
LOCATION = re.compile('.*r-bloczek-lokalizacja.*')
RATING = re.compile('.*r-bloczek-ocena r-bloczek__ocena.*')
PRICE = re.compile('.*r-bloczek-cena__aktualna.*')
STARS = 'r-gwiazdki r-gwiazdki--medium'
DATE = 'r-bloczek-wlasciwosc__dni'
AIRPORT = 'r-bloczek-wlasciwosc'
FOOD = 'r-bloczek-wlasciwosc'
OLD_PRICE = 'r-bloczek-cena__przed-obnizka'
def get_price(offer_block_html):
offer = offer_block_html
offer_price = offer.find("span", {"class": PRICE}).get_text().strip()
price = float(re.search('\d+\ \d+', offer_price).group().replace(' ', ''))
# old price
price_before = offer.find("span", {"class": OLD_PRICE})
if price_before is not None:
price_before = offer_price_before.get_text().strip()
price_before = float(re.search('\d+\ \d+', price_before).group().replace(' ', ''))
else:
price_before = np.nan
return price, price_before
def get_food(offer_block_html):
offer = offer_block_html
offer_food = offer.find_all("div", {"class": AIRPORT})
if len(offer_food) > 2:
offer_food = offer_food[2].find_all('span')[0].get_text().strip()
else:
offer_food = np.nan
return offer_food
def get_airport(offer_block_html):
offer = offer_block_html
airport = offer.find_all("div", {"class": AIRPORT})[1].find_all('span')[0].get_text().strip()
return airport
def get_time(offer_block_html):
offer = offer_block_html
# '05.05.2023 (8 days)' format
offer_date = offer.find("div", {"class": DATE}).find_all('span')[0].get_text().strip()
duration = int(re.search('\d+ dni', offer_date).group().replace(' dni', ''))
date = re.search('[\d\.]+ \(', offer_date).group().replace(' (', '')
return date, duration
def get_rating(offer_block_html):
offer = offer_block_html
offer_rating = offer.find("span", {"class": RATING})
if offer_rating is not None:
offer_rating = offer_rating.get_text().strip()
offer_rating = offer_rating.replace('\xa0', '')
ratings_no = int(re.search('\(\d+\.*', offer_rating).group().replace('(', ''))
rating = float(re.search('\d+\.\d+', offer_rating).group().replace(' ', ''))
else:
rating = np.nan
ratings_no = 0
return rating, ratings_no
def get_stars(offer_block_html):
offer = offer_block_html
offer_stars = offer.find("div", {"class": STARS})
if offer_stars is not None:
offer_stars = float(offer_stars['data-rating'])
else:
offer_stars = np.nan
return offer_stars
def get_loc(offer_block_html):
offer = offer_block_html
loc = offer.find("span", {"class": LOCATION}).get_text().strip()
loc = loc.replace('\u205f•\u205f', ' ')
loc = loc.replace('\u205f•', ' ')
# split location into county an d city
loc_component = loc.split(":")
if len(loc_component) > 1:
country, city = loc_component[:2]
else:
country = loc_component[0]
city = np.nan
# remove words not related to location such as "vacation" or "tour"
country = country.replace("Wypoczynek ", "")
country = country.replace("Objazd + wypoczynek ", "")
country = country.replace("Objazd ", "")
return country, city
def get_offer_df(offer_block_html):
"""
Function retrieves information about single offer from html and saves them to pandas data frame.
"""
offer = offer_block_html
# offer title
title = offer.find("span", {"class": TITLE}).get_text().strip()
# trip destination
country, city = get_loc(offer)
# stars
offer_stars = get_stars(offer)
# rating
rating, ratings_no = get_rating(offer)
# date
date, duration = get_time(offer)
# airport
airport = get_airport(offer)
# food
food = get_food(offer)
# price
price, price_before = get_price(offer)
# create dict with info about a trip
trips_dict = {
"title": [title],
"location": [loc],
"country": [country],
"city": [city],
"stars": [offer_stars],
"rating": [rating],
"ratings_no": [ratings_no],
"date": [date],
"duration": [duration],
"airport": [offer_airport],
"food": [offer_food],
"price": [price],
"old_price": [offer_price_before]
}
return pd.DataFrame(trips_dict)
# create data frame where each row corresponds with info about a single offer
trips_df = [get_offer_df(offer) for offer in offer_blocks]
trips_df = pd.concat(trips_df, ignore_index=True)
# save data to .csv
trips_df.to_csv("rainbow_summer_2023.csv", index=False)
In the second part of the project, a descriptive analysis was conducted on the data related to summer 2023 tours organized by Rainbow Tour travel agency. As part of the analysis, descriptive statistics of features were interpreted and visualized on graphs.
The analyzed dataset consists of 2098 records and 13 columns. Each row describes a single tour available for purchase on the Rainbow Tour travel agency website. All trips are scheduled between May and December 2023.
The columns present in the described dataset contain the following information about the tours:
Some variables, such as "City," "Stars," "Rating," and "Old price," have missing values in the data, which can be expected from the dataset obtained using web-scraping. Tours with missing values in the "City" column usually have multiple destinations. The "Stars" and "Rating" columns may be empty because some offers have never been rated by the agency or any user on the website. The absence of values in the "Old price" column indicates that the tour still has its base price and has never been discounted.
raw_data = pd.read_csv("data.csv")
raw_data[['country', 'city', 'airport', 'food']] = raw_data[['country', 'city', 'airport', 'food']].apply(lambda x: x.str.strip())
raw_data.head()
| title | location | country | city | stars | rating | ratings_no | date | duration | airport | food | price | old_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Paras Sun | Wypoczynek Grecja: Rodos | Grecja | Rodos | 2.000 | 4.500 | 188 | 12.05.2023 | 8 | Warszawa Chopin | Śniadania | 1500.000 | NaN |
| 1 | Anaxos Bay | Wypoczynek Grecja: Lesbos | Grecja | Lesbos | 2.000 | 5.000 | 11 | 30.08.2023 | 8 | Katowice | Bez wyżywienia | 1558.000 | NaN |
| 2 | Elena Studios | Wypoczynek Grecja: Lesbos | Grecja | Lesbos | 2.000 | 4.900 | 25 | 30.08.2023 | 8 | Katowice | Bez wyżywienia | 1635.000 | NaN |
| 3 | Studia Maria | Wypoczynek Grecja: Kokkino Nero | Grecja | Kokkino Nero | 2.500 | 5.200 | 71 | 31.05.2023 | 8 | Wrocław | Bez wyżywienia | 1766.000 | NaN |
| 4 | Apartamenty Omiś | Wypoczynek Chorwacja: Kastelanska Riwiera - Ma... | Chorwacja | Kastelanska Riwiera - Makarska Riwiera | 2.500 | 3.700 | 12 | 22.06.2023 | 8 | Katowice | Bez wyżywienia | 1779.000 | NaN |
raw_data.shape
(2098, 13)
The printout below summarises number of missing values and datatypes of each variable.
raw_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2098 entries, 0 to 2097 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 title 2098 non-null object 1 location 2098 non-null object 2 country 2097 non-null object 3 city 1869 non-null object 4 stars 1621 non-null float64 5 rating 1511 non-null float64 6 ratings_no 2098 non-null int64 7 date 2098 non-null object 8 duration 2098 non-null int64 9 airport 2098 non-null object 10 food 2083 non-null object 11 price 2098 non-null float64 12 old_price 787 non-null float64 dtypes: float64(4), int64(2), object(7) memory usage: 213.2+ KB
One of the tours present in the dataset has missing value in "Country" and "City" fields. Their values were concluded after visiting the website with the trip offer and imputed appropriately.
raw_data.loc[693]
title Peridis Family Resort location Wypoczynek country NaN city NaN stars 5.000 rating NaN ratings_no 0 date 26.09.2023 duration 8 airport Wrocław food Bez wyżywienia price 3316.000 old_price NaN Name: 693, dtype: object
# update trip with missing data about country and city
# data were imputed based on the resort name
raw_data.loc[693, ["country", "city"]] = ["Grecja", "Kos"]
The table below presents descriptive statistics of the numeric variables from the dataset. Values indicate that the average offer rating by the agency is approximately 4 stars in scale from 0 to 5. There is no trip with number of stars lower than 2. Custmomers rating vary from 0 to 6 with the mean equal to 5,2 and standard deviation equal to 0,5. The average trip has been rated 42 times. Furthermore, the majority of tours is planned for 8 days. Trip prices follow a right-side skewed distribution with a mean price equal to 5,400 PLN. There are some offers that prices exceed 10,000 PLN, though. 787 out of 2098 tour packages is discounted on the website. The mean price of all tours calculated using the base price (before discount) equals approximately 7,500 PLN.
raw_data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| stars | 1621.000 | 4.194 | 0.752 | 2.000 | 4.000 | 4.000 | 5.000 | 5.000 |
| rating | 1511.000 | 5.204 | 0.513 | 2.800 | 4.900 | 5.300 | 5.500 | 6.000 |
| ratings_no | 2098.000 | 42.864 | 94.447 | 0.000 | 0.000 | 7.000 | 41.000 | 1153.000 |
| duration | 2098.000 | 8.137 | 1.984 | 1.000 | 8.000 | 8.000 | 8.000 | 22.000 |
| price | 2098.000 | 5403.184 | 3341.820 | 1500.000 | 3030.000 | 4174.000 | 7178.000 | 31845.000 |
| old_price | 787.000 | 7511.075 | 3942.276 | 2447.000 | 4298.000 | 7383.000 | 9109.000 | 32345.000 |
The printout below presents information about number of unique values in each column.
raw_data.nunique()
title 2018 location 440 country 155 city 263 stars 7 rating 31 ratings_no 253 date 199 duration 20 airport 12 food 16 price 1702 old_price 734 dtype: int64
As mentioned earlier, the aim of this project is to build a classifier capable of identifying tours that are most popular among all Rainbow Tour offerings. The popularity of a trip is determined based on the number of reviews it has received from the agency's customers. If a tour has been reviewed more than 30 times, it is labeled as popular.
Applying this criterion reveals that 621 offers have been labeled as popular. Since the number of popular tours (the positive class share) is significantly smaller, the dataset is considered imbalanced. This imbalance may pose challenges during model training. Therefore, in the subsequent steps of this project, certain techniques have been applied to mitigate the impact of the imbalanced class problem.
raw_data["is_popular"] = np.where(raw_data["ratings_no"] > 30, 1, 0)
target = "is_popular"
# make column names consistent
old_cols = raw_data.columns
# create another dataframe for plotting the data
plot_data = raw_data.copy()
# prettify column names to make them looks better on plots
def prettify_text(text):
text = text.replace(".", "")
text = text.replace("_", " ")
text = " ".join(re.sub(r"([A-Z])", r" \1", text).split())
text = text[0].upper() + text[1:].lower()
return text
binary_map = {0: "No", 1: "Yes"}
plot_data[target] = plot_data[target].map(binary_map)
plot_data.columns = [prettify_text(col) for col in old_cols]
plt.figure(figsize=(7, 6))
ax = sns.countplot(data=plot_data,
x="Is popular",
order=plot_data["Is popular"].value_counts().index)
ax.set(ylabel="Count")
plt.show()
The analyzed offers have various destinations primarily located in European countries. The most popular destinations for vacations organized by Rainbow are Greece (over 250 trips), Spain (approximately 200 trips), and Turkey (also around 200 trips). Among other frequently visited countries are Egypt, Italy, Bulgaria, and Tunisia. Greece and Egypt exhibit the highest percentage of popular tours among all available offers.
Concerning the most frequently visited cities, the dominant ones include the Turkish Riviera, Varadero, and Hurghada. Sicily, Sunny Beach, Tenerife, and the Egyptian Riviera are also popular. The highest percentage of popular tours among all available offers to these cities characterizes Varadero and Hurghada, as well as the Egyptian Riviera.
def words_counter(data, col):
words_cnt = dict(Counter(data[col]))
words_cnt_list = [(key, val) for key, val in words_cnt.items()]
words_cntr_df = pd.DataFrame(words_cnt_list,
columns=[col.title(), "Count"]).sort_values(["Count"], ascending=False)
return words_cntr_df
country_cnt = words_counter(raw_data, 'country')
city_cnt = words_counter(raw_data, 'city')
city_cnt = city_cnt.dropna()
popular_countires = country_cnt.where(country_cnt['Count'] > 85)['Country'].dropna().unique()
popular_cities = city_cnt.where(city_cnt['Count'] > 39)['City'].dropna().unique()
fig, axs = plt.subplots(2, 2, figsize=(20, 10))
ax = sns.barplot(x='Country', y='Count', data=country_cnt.head(7), ax=axs[0, 0])
axs[0, 0].set(ylabel="Count")
ax = sns.barplot(x='City', y='Count', data=city_cnt.head(7), ax=axs[0, 1])
axs[0, 1].set(ylabel="Count")
sns.countplot(data=plot_data[plot_data['Country'].isin(popular_countires)],
x='Country', hue='Is popular', ax=axs[1, 0], dodge=False, hue_order = ['No', 'Yes'],
order=plot_data[plot_data['Country'].isin(popular_countires)]["Country"].value_counts().index)
axs[1, 0].set(xlabel="Country", ylabel="Count")
sns.countplot(data=plot_data[plot_data['City'].isin(popular_cities)],
x='City', hue='Is popular', dodge=False, ax=axs[1, 1], hue_order = ['No', 'Yes'],
order=plot_data[plot_data['City'].isin(popular_cities)]["City"].value_counts().index)
axs[1, 1].set(xlabel="City", ylabel="Count")
plt.show()
It has been observed that some tours fall into the category of guided tours, meaning that during vacation customers of the agency visit more than one country. It was decided to label these tours by creating a new column called "Multiple destination." It can be noted that a very small number of offers is organized to more than one country.
for country in raw_data["country"].unique()[-5:]:
print("-", country)
- Peru, Brazylia, Argentyna, Chile, Boliwia - Australia - Australia i Nowa Zelandia - Singapur, Australia, Nowa Zelandia, Fidżi - Australia, Zjednoczone Emiraty Arabskie, Nowa Zelandia
# create column indicating if trip has multiple destinations
raw_data['multiple_destination'] = np.where(raw_data["country"].str.contains(", "), 1,
np.where(raw_data["country"].str.contains(" i "), 1, 0))
plot_data['Multiple destination'] = raw_data['multiple_destination'].copy()
plot_data['Multiple destination'] = plot_data['Multiple destination'].map(binary_map)
plt.figure(figsize=(10, 5))
ax = sns.countplot(x='Multiple destination', hue='Is popular', dodge=1,
data=plot_data, hue_order = ['No', 'Yes'])
ax.set(ylabel="Count")
plt.show()
From the distribution of number of stars, it is evident that stars were awarded by either the agency expert or a simple algorithm, as integer values predominate (primarily 4 or 5 stars). The tours rated with 4 or 4.5 stars have the greatest popularity.
On the other hand, the density plot of user ratings based on tour popularity indicates that higher-rated tours, around average ratings of 5 - 5.5, tend to be more popular, but not exceptionally so. Tours below a rating of 5 and above 5.5 are often less popular among users. This could be attributed to the fact that frequently booked tours also receive a substantial number of user ratings, and users may assess tours subjectively — rating them higher or lower at different times. As a result, the ratings tend to average out on a some level. If a tour is less popular, it often has a small number of ratings, which can result in a very high or low average rating if a few users rate the tour similarly. Such ratings are less representative.
fig, axs = plt.subplots(1, 2, figsize=(20, 5), gridspec_kw=dict(width_ratios=[1, 2]))
sns.countplot(data=plot_data, x='Stars', hue='Is popular',
dodge=0, ax=axs[0], hue_order = ['No', 'Yes'])
axs[0].set(xlabel="Stars", ylabel="Count")
axs[0].legend(title='Is popular', loc='upper left', labels=['No', 'Yes'])
sns.histplot(plot_data.loc[plot_data['Is popular']=="No", "Rating"], ax=axs[1], kde=True)
sns.histplot(plot_data.loc[plot_data['Is popular']=="Yes", "Rating"], ax=axs[1], kde=True, color="darkorange")
axs[1].set(xlabel="Rating")
plt.show()
Analyzing the chart of the months in which tours are organized reveals the surprising fact that the majority of them are scheduled in December. This may be attributed, for example, to the fact that people go on vacation during the Christmas holidays. Many offers are available for May and June, as well as September and October. The most popular departures occur in the spring months of May and June. The most frequent duration of tours is 8-9 days, while a minimal portion lasts for 4-5 days. Other tour lengths occur very rarely.
raw_data['month'] = pd.to_datetime(raw_data['date'], format="%d.%m.%Y").dt.strftime('%B')
plot_data['Month'] = raw_data['month'].copy()
mnth_order = ['May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
fig, axs = plt.subplots(1, 2, figsize=(20, 5), gridspec_kw=dict(width_ratios=[2, 3]))
sns.countplot(x='Month', hue='Is popular', dodge=0, data=plot_data,
order=mnth_order, ax=axs[0], hue_order = ['No', 'Yes'])
axs[0].set(ylabel="Count", xlabel='Month')
axs[0].legend(title='Is popular', loc='upper left', labels=['No', 'Yes'])
axs[0].set_xticklabels(mnth_order, rotation=25)
sns.countplot(x='Duration', hue='Is popular', dodge=0, data=plot_data,
ax=axs[1], hue_order = ['No', 'Yes'])
axs[1].set(ylabel="Count", xlabel='Duration')
axs[1].legend(title='Is popular', loc='upper left', labels=['No', 'Yes'])
plt.show()
The majority of Rainbow Tours depart from Warsaw Chopin Airport or Katowice. However, the departure location does not seem to impact the popularity of the tour.
Regarding meal options, the most commonly available choices are breakfast, breakfast and dinner, or "ultra all-inclusive." Among these three, tours offering breakfast and dinner are the most popular.
raw_data['airport'] = raw_data['airport'].replace('/', '', regex=True)
plot_data['airport'] = raw_data['airport'].copy()
air_cnt = words_counter(raw_data, 'airport')
popular_airport = air_cnt.where(air_cnt['Count'] > 20)['Airport']
food_cnt = words_counter(raw_data, 'food')
popular_food = food_cnt.where(air_cnt['Count'] > 20)['Food']
fig, axs = plt.subplots(1, 2, figsize=(22, 5))
ax = sns.countplot(x='Airport', hue='Is popular', dodge=1, ax=axs[0],
data=plot_data[plot_data['Airport'].isin(popular_airport)], hue_order = ['No', 'Yes'])
ax.set(ylabel="Count", xlabel='Airport')
ax = sns.countplot(x='Food', hue='Is popular', dodge=1, ax=axs[1],
data=plot_data[plot_data['Food'].isin(popular_food)], hue_order = ['No', 'Yes'])
ax.set(ylabel="Count", xlabel='Food type')
plt.xticks(rotation=25)
plt.show()
The following charts indicate that tours with lower prices are more popular, both in terms of current prices and considering prices before promotions. This is not surprising, as more affordable trips are financially accessible to a larger number of people.
fig, axs = plt.subplots(2, 2, figsize=(20, 10), gridspec_kw=dict(width_ratios=[1, 2]))
sns.barplot(data=plot_data, y='Price', x='Is popular', ax=axs[0,0])
axs[0,0].set(ylabel="Mean price")
sns.kdeplot(plot_data[plot_data['Is popular']=="No"]['Price'], ax=axs[0,1], legend=False)
sns.kdeplot(plot_data[plot_data['Is popular']=="Yes"]['Price'], ax=axs[0,1])
axs[0,1].set(xlabel="Price")
axs[0,1].legend(title='Is popualar', loc='upper left', labels=['No', 'Yes'])
sns.barplot(data=plot_data, y='Old price', x='Is popular', ax=axs[1,0])
axs[1,0].set(ylabel="Mean old price")
sns.kdeplot(plot_data[plot_data['Is popular']=="No"]['Old price'], ax=axs[1,1], legend=False)
sns.kdeplot(plot_data[plot_data['Is popular']=="Yes"]['Old price'], ax=axs[1,1])
axs[1,1].set(xlabel="Old price")
axs[1,1].legend(title='Is popualar', loc='upper left', labels=['No', 'Yes'])
plt.show()
The following charts suggest that non-discounted offers or those with a minimal price reduction receive more reviews on the website compared to those with a substantial price reduction. This may be due to the agency offering discounts on less popular tours to attract a larger number of customers. Another explanation could be that the extent of the discount is positively correlated with the tour's original price. Therefore, the larger the discount, the more expensive the tour, and as noted earlier, expensive tours are less popular than their more affordable counterparts.
raw_data["discount"] = np.where(raw_data["old_price"].isna(), 0, 1)
raw_data["discount_value"] = np.where(raw_data["old_price"].isna(), 0,
raw_data["old_price"] - raw_data["price"])
plot_data['Discount'] = raw_data['discount'].copy().map(binary_map)
plot_data['Discount value'] = raw_data['discount_value'].copy()
fig, axs = plt.subplots(1, 2, figsize=(20, 7), gridspec_kw=dict(width_ratios=[1, 2]))
ax = sns.countplot(x='Discount', hue='Is popular', dodge=1,
data=plot_data, hue_order = ['No', 'Yes'], ax=axs[0])
ax.set(ylabel="Count")
sns.histplot(plot_data[plot_data['Discount value'] > 0],
x='Discount value', hue="Is popular", ax=axs[1], legend=True, kde=True)
plt.show()
After an initial analysis of the dataset, it was decided to remove some columns as no impact on the response variable was observed. Among the deleted columns were, for example, the airport and the city to which the tour is organized (the column had too many unique values).
features = 'country, stars, rating, duration, food, price, month, multiple_destination, discount, is_popular'.split(", ")
final_data = raw_data[features]
final_data.head()
| country | stars | rating | duration | food | price | month | multiple_destination | discount | is_popular | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Grecja | 2.000 | 4.500 | 8 | Śniadania | 1500.000 | May | 0 | 0 | 1 |
| 1 | Grecja | 2.000 | 5.000 | 8 | Bez wyżywienia | 1558.000 | August | 0 | 0 | 0 |
| 2 | Grecja | 2.000 | 4.900 | 8 | Bez wyżywienia | 1635.000 | August | 0 | 0 | 0 |
| 3 | Grecja | 2.500 | 5.200 | 8 | Bez wyżywienia | 1766.000 | May | 0 | 0 | 1 |
| 4 | Chorwacja | 2.500 | 3.700 | 8 | Bez wyżywienia | 1779.000 | June | 0 | 0 | 0 |
Since the majority of columns in the analyzed dataset consist of categorical variables, often with multiple possible values, it was decided to transform these variables into binary variables. Two columns were created for the countries Greece and Egypt because they frequently appear in the offers and exhibit a respectively high and low percentage of popular tours. Offers with breakfast and breakfast and dinner were also distinguished as those categories are characterized with the highest and lowest popularity, respectively. Additionally, a separate column to indicate whether the tour is organized during the holiday season (from May to August) or later was created.
final_data['greece'] = np.where(~final_data['country'].isin(['Grecja']), 0, 1)
final_data['egypt'] = np.where(~final_data['country'].isin(['Egipt']), 0, 1)
final_data['breakfast'] = np.where(final_data['food'].isin(['Śniadania']), 1, 0)
final_data['breakfast_dinner'] = np.where(final_data['food'].isin(['Śniadania i obiadokolacje']), 1, 0)
final_data['summer'] = np.where(final_data['month'].isin(['May', 'June', 'July', 'August']), 1, 0)
final_data.drop(columns=['country', 'food', 'month'], inplace=True)
final_data.head()
| stars | rating | duration | price | multiple_destination | discount | is_popular | greece | egypt | breakfast | breakfast_dinner | summer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.000 | 4.500 | 8 | 1500.000 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 |
| 1 | 2.000 | 5.000 | 8 | 1558.000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 2 | 2.000 | 4.900 | 8 | 1635.000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 3 | 2.500 | 5.200 | 8 | 1766.000 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
| 4 | 2.500 | 3.700 | 8 | 1779.000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Since a significant portion (477) of tours have no stars assigned by the agency, it was decided to approximate it using the average user ratings. Subsequently, the column with the average number of ratings was removed. It was observed that the average user rating corresponding to star ratings from 2 to 5 mostly corresponds wit a user rating from 4.9 to 5.4. Growth by 1 star corresponds with increase in a average customer rating approximately by 0.1. This pattern does not occur in every interval (sometimes the average rating decreases instead of increasing), but it was decided to fill in the missing observations in the "Stars" column in this way. Tours with an average rating below 4.9 were assigned 2 stars. Offers rated between 4.9 and 5 received 2.5 stars, and so on, up to offers rated above 5.4, which were assigned 5 stars. Since the method of filling in missing stars uses somewhat arbitrarily chosen threshold values and imputed values that are independent of the input set, the imputation of missing data was performed before dividing the dataset into training and testing sets.
final_data.isna().sum()
stars 477 rating 587 duration 0 price 0 multiple_destination 0 discount 0 is_popular 0 greece 0 egypt 0 breakfast 0 breakfast_dinner 0 summer 0 dtype: int64
starts_mean_rating = final_data.groupby(['stars']).mean()['rating'].reset_index()
starts_mean_rating["count"] = final_data.groupby(['stars']).count()['rating'].reset_index()['rating']
starts_mean_rating
| stars | rating | count | |
|---|---|---|---|
| 0 | 2.000 | 4.967 | 21 |
| 1 | 2.500 | 4.689 | 9 |
| 2 | 3.000 | 5.001 | 162 |
| 3 | 3.500 | 4.954 | 37 |
| 4 | 4.000 | 5.173 | 458 |
| 5 | 4.500 | 5.230 | 81 |
| 6 | 5.000 | 5.405 | 347 |
def input_missing_stars(row):
stars = row['stars']
if not math.isnan(stars):
return stars
rating = row["rating"]
if rating is np.nan:
return 3.5 # return mean number of stars
for threshold, stars in list(zip(np.arange(49, 55)/10, np.arange(20, 50, 5)/10)):
if rating < threshold:
return stars
return 5.0
final_data['stars'] = final_data.apply(input_missing_stars, axis=1)
After imputing missing values in the "Stars" columns, there are no more missing values in the dataset.
final_data.drop(columns=['rating'], inplace=True)
final_data.isna().sum()
stars 0 duration 0 price 0 multiple_destination 0 discount 0 is_popular 0 greece 0 egypt 0 breakfast 0 breakfast_dinner 0 summer 0 dtype: int64
In the next step, the dataset was divided into a training set and a test set in a 9:1 ratio. The "stratify" argument was used to ensure the same proportion of popular and unpopular tours in both sets. Below, descriptive statistics for both sets were compared to ensure that the observations in the sets are similar.
X_cols = list(final_data.columns)
X_cols.remove('is_popular')
X, y = final_data.loc[:, X_cols], final_data.loc[:, 'is_popular']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=seed, stratify=y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((1888, 10), (210, 10), (1888,), (210,))
X_train.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| stars | 1888.000 | 4.166 | 0.851 | 2.000 | 4.000 | 4.000 | 5.000 | 5.000 |
| duration | 1888.000 | 8.118 | 1.932 | 1.000 | 8.000 | 8.000 | 8.000 | 21.000 |
| price | 1888.000 | 5369.082 | 3235.064 | 1500.000 | 3035.000 | 4182.000 | 7142.750 | 31598.000 |
| multiple_destination | 1888.000 | 0.056 | 0.230 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| discount | 1888.000 | 0.379 | 0.485 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 |
| greece | 1888.000 | 0.135 | 0.341 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| egypt | 1888.000 | 0.060 | 0.238 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| breakfast | 1888.000 | 0.249 | 0.433 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| breakfast_dinner | 1888.000 | 0.197 | 0.397 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| summer | 1888.000 | 0.367 | 0.482 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 |
X_test.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| stars | 210.000 | 4.140 | 0.871 | 2.000 | 4.000 | 4.000 | 5.000 | 5.000 |
| duration | 210.000 | 8.310 | 2.401 | 3.000 | 8.000 | 8.000 | 9.000 | 22.000 |
| price | 210.000 | 5709.781 | 4177.980 | 1779.000 | 2947.250 | 4049.000 | 7434.000 | 31845.000 |
| multiple_destination | 210.000 | 0.067 | 0.250 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| discount | 210.000 | 0.343 | 0.476 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 |
| greece | 210.000 | 0.114 | 0.319 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| egypt | 210.000 | 0.057 | 0.233 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| breakfast | 210.000 | 0.229 | 0.421 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| breakfast_dinner | 210.000 | 0.200 | 0.401 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| summer | 210.000 | 0.386 | 0.488 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 |
y_test.mean(), y_train.mean()
(0.29523809523809524, 0.2960805084745763)
In this part, both the training and test sets were standardized using the mean and standard deviation calculated from the training set. Standardizing the data can be beneficial as it improves the numerical stability of the model and often shortens the construction time. Some of the methods used in this project require input data to be standardized.
def scale_train_test(X_train, X_test):
"""
Calculates mean (m) and standard deviation (sd)
of a training set and standardizes both train and test set
According to the formula z = (x - m) / sd.
"""
m, sd = X_train.mean(), X_train.std()
X_train_scaled = (X_train - m) / sd
X_test_scaled = (X_test - m) / sd
return X_train_scaled, X_test_scaled
X_train_scaled, X_test_scaled = scale_train_test(X_train, X_test)
This section covers the construction of three models: Random Forest, Gradient Boosting Trees, and Support Vector Machines. All models were evaluated in terms of their classification abilities and compared. An attempt was also made to interpret them using relevant charts.
The first model constructed to identify popular Rainbow Tour trips is the Random Forest. The model utilizes multiple decision trees and makes decisions based on majority voting, i.e., the classification indicated by the majority of trees. During the construction of each decision tree node, only a small subset of all features is considered. This prevents the model from using the same, most influential predictors when building each tree. As a result, many uncorrelated decision trees are created, and their predictions are averaged to obtain the final classification decision.
Before training the model, hyperparameter tuning was conducted to select a combination of model parameters that resulted in the lowest classification error rate during cross-validation. This was done using random search, involving the exploration of randomly chosen values from a defined parameter space. The method identified the following optimal hyperparameters:
# liczba drzew
n_estimators = [int(x) for x in np.linspace(50, 300, 6, endpoint=True)]
# metoda obliczania max. liczby zmiennych do rozważenia przy każdym podziale drzewa
max_features = ['log2', 'sqrt']
# głębokość drzewa
max_depth = [int(x) for x in np.linspace(1, 4, 4, endpoint=True)]
# minimalna liczba próbek obecnych w liściu
min_samples_leaf = [int(x) for x in np.linspace(1, 11, 5, endpoint=True)]
# minimalna liczba obserwacji potrzebna do podziału węzła
min_samples_split = [int(x) for x in np.linspace(5, 55, 5, endpoint=True)]
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_leaf': min_samples_leaf,
'min_samples_split': min_samples_split}
# forest = RandomForestClassifier(random_state=seed)
# rf_cv = RandomizedSearchCV(estimator = forest,
# param_distributions = random_grid,
# n_iter = 150, cv=5, verbose=2,
# random_state=seed, n_jobs = -1)
# rf_cv.fit(X_train, y_train)
# rf_cv.best_params_
Fitting 5 folds for each of 150 candidates, totalling 750 fits
{'n_estimators': 300,
'min_samples_split': 5,
'min_samples_leaf': 3,
'max_features': 'log2',
'max_depth': 4}
# with open('rf_cv.pickle', 'wb') as handle:
# pickle.dump(rf_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('rf_cv.pickle', 'rb') as handle:
rf_cv = pickle.load(handle)
rf = RandomForestClassifier(n_estimators=rf_cv.best_params_['n_estimators'],
min_samples_split=rf_cv.best_params_['min_samples_split'],
min_samples_leaf=rf_cv.best_params_['min_samples_leaf'],
max_features=rf_cv.best_params_['max_features'],
max_depth=rf_cv.best_params_['max_depth'], random_state=seed)
rf.fit(X_train, y_train)
RandomForestClassifier(max_depth=4, max_features='log2', min_samples_leaf=3,
min_samples_split=5, n_estimators=300, random_state=123)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(max_depth=4, max_features='log2', min_samples_leaf=3,
min_samples_split=5, n_estimators=300, random_state=123)y_train_pred = rf.predict(X_train)
y_pred = rf.predict(X_test)
class color:
"""Enables formatting printed strings."""
RED = '\033[91m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
END = '\033[0m'
def fevaluate_model(y_test, y_pred):
"""
Returns dictionary of main metrics used to evaluate a classification model,
Comaprind predictions with actual values.
Takes two arguments: real Y values and predicted Y values respectively.
"""
# acc, roc_auc
acc = accuracy_score(y_pred, y_test)
err = 1 - acc
auc = roc_auc_score(y_pred, y_test)
# calculate TN, FN, FP, TP
TN, FN, FP, TP = confusion_matrix(y_test, y_pred).ravel()
# calculate specificity, sensitivity, positive predictive value, and negative predictive value
recall = TP / (TP+FN) # czułość
specificity = TN / (TN+FP) # swoistość
PPV = TP / (TP+FP) # precision
NPV = TN / (TN+FN)
# F1 score
F1 = 2*(recall*PPV)/(recall+PPV)
perf_metrics = {
"Accuracy": acc,
"Error rate": err,
"AUC": auc,
"Recall": recall,
"Specificity": specificity,
"PPV": PPV,
"NPV": NPV,
"F1": F1,
}
return perf_metrics
def evaluate_model(clf, X_test, y_test, X_train=None, y_train=None, verbose=True, threshold=0.5):
"""
Returns dictionary of metrics assessing model performance on test set and,
Optionally, on train set for a given classification threshold.
Takes three mandatory arguments: trained model, X test set and y test vector.
Optional arguments are: X train set, y train vector, boolean parameter indicating,
If results should be printed (default True), and threshold value (default 0.5).
"""
def eval_print(metrics_dict, set_name="test"):
"""Prints values of model performance metrcics stored in dictionary."""
print(color.BOLD + f"Evaluation measures on {set_name} set:" + color.END)
for key, value in metrics_dict.items():
print(f"{key}: {round(value, 3)}")
print("-"*40)
perf_metrics = {}
if X_train is not None and y_train is not None:
# check model performance on train set
y_pred_train = (clf.predict_proba(X_train)[:,1] >= threshold).astype(bool)
train_metrics = fevaluate_model(y_pred_train, y_train)
if verbose:
eval_print(train_metrics, set_name="train")
perf_metrics["Train"] = train_metrics
# check model performance on test set
y_pred_test = (clf.predict_proba(X_test)[:,1] >= threshold).astype(bool)
test_metrics = fevaluate_model(y_pred_test, y_test)
if verbose:
eval_print(test_metrics)
perf_metrics["Test"] = test_metrics
return perf_metrics
rf_perf_metrics = evaluate_model(rf, X_test, y_test, X_train, y_train)
Evaluation measures on train set: Accuracy: 0.728 Error rate: 0.272 AUC: 0.546 Recall: 0.1 Specificity: 0.992 PPV: 0.848 NPV: 0.724 F1: 0.179 ---------------------------------------- Evaluation measures on test set: Accuracy: 0.7 Error rate: 0.3 AUC: 0.506 Recall: 0.032 Specificity: 0.98 PPV: 0.4 NPV: 0.707 F1: 0.06 ----------------------------------------
def plot_conf_matrix(y_test, y_pred):
"""
Plots confusion matrix based on predicted
And real values of target variable y.
"""
color = 'black'
cf_matrix = confusion_matrix(y_test, y_pred)
group_names = ['TN','FP','FN','TP']
group_counts = [round(value, 3) for value in cf_matrix.flatten()/cf_matrix.sum()]
labels = [f'{v1}\n\n{v2}' for v1, v2 in zip(group_names, group_counts)]
labels = np.asarray(labels).reshape(2, 2)
ax = sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
ax.set_ylabel('True labels', color=color)
ax.set_xlabel('Predicted labels', color=color)
ax.set_title('Confusion Matrix', color=color)
plt.grid(False)
plt.show()
plot_conf_matrix(y_test, y_pred)
The Random Forest classifier was trained for the parameters selected during hyperparameter tuning. The accuracy of the model predictions on the test set was 0.7, indicating that 70% of the trips were correctly classified. However, the above printout and confusion matrix suggest that the model performs very poorly in identifying popular tours. The Random Forest assigns almost all observations to the class of unpopular tours. Since the dataset is unbalanced, this allows achieving a relatively high classification accuracy by simply predicting the negative class only. On the other hand, a very low sensitivity result and F1-score indicate that the model struggles to identify the positive class at all. Sensitivity indicates that only 3.2% of popular tours were correctly identified by the model.
To improve the Random Forest model's ability to identify objects from the positive class, the SMOTE method was applied. The SMOTE algorithm is an oversampling technique that, instead of simply replicating observations from the minority class, generates new points as a combination of current observations and their nearest neighbors. After transforming the training set by generating new representatives of the positive class, hyperparameter tuning was performed again, and a new Random Forest model was built.
smote = SMOTE(random_state=seed)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
X_train_scaled_smote, _ = scale_train_test(X_train_smote, X_train_smote)
y_train_smote.value_counts()
is_popular 1 1329 0 1329 Name: count, dtype: int64
# rf_smote = RandomForestClassifier(random_state=seed)
# rf_smote_cv = RandomizedSearchCV(estimator = rf_smote,
# param_distributions = random_grid,
# n_iter = 150, cv=5, verbose=2,
# random_state=seed, n_jobs = -1)
# rf_smote_cv.fit(X_train_smote, y_train_smote)
# rf_smote_cv.best_params_
Fitting 5 folds for each of 150 candidates, totalling 750 fits
{'n_estimators': 250,
'min_samples_split': 30,
'min_samples_leaf': 11,
'max_features': 'sqrt',
'max_depth': 4}
# with open('rf_smote_cv.pickle', 'wb') as handle:
# pickle.dump(rf_smote_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('rf_smote_cv.pickle', 'rb') as handle:
rf_smote_cv = pickle.load(handle)
rf_smote = RandomForestClassifier(
n_estimators=rf_smote_cv.best_params_['n_estimators'],
min_samples_split=rf_smote_cv.best_params_['min_samples_split'],
min_samples_leaf=rf_smote_cv.best_params_['min_samples_leaf'],
max_features=rf_smote_cv.best_params_['max_features'],
max_depth=rf_smote_cv.best_params_['max_depth'], random_state=seed
)
rf_smote.fit(X_train_smote, y_train_smote)
RandomForestClassifier(max_depth=4, min_samples_leaf=11, min_samples_split=30,
n_estimators=250, random_state=123)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(max_depth=4, min_samples_leaf=11, min_samples_split=30,
n_estimators=250, random_state=123)y_train_pred = rf_smote.predict(X_train_smote)
y_pred = rf_smote.predict(X_test)
rf_smote_perf_metrics = evaluate_model(rf_smote, X_test, y_test, X_train_smote, y_train_smote)
Evaluation measures on train set: Accuracy: 0.732 Error rate: 0.268 AUC: 0.732 Recall: 0.796 Specificity: 0.667 PPV: 0.705 NPV: 0.766 F1: 0.748 ---------------------------------------- Evaluation measures on test set: Accuracy: 0.595 Error rate: 0.405 AUC: 0.6 Recall: 0.613 Specificity: 0.588 PPV: 0.384 NPV: 0.784 F1: 0.472 ----------------------------------------
plot_conf_matrix(y_test, y_pred)
The application of the SMOTE algorithm improved the model's ability to detect positive samples. The sensitivity of the model on the test set increased from 3.2% to 61.3%, meaning that 61.3% of popular trips are correctly identified by the model. On the other hand, the overall classification accuracy of the model decreased by 10% compared to the naive model that classifies all observations into the negative class. The new model has a much lower specificity, meaning that the increased ability to recognize objects from the positive class negatively impacted the percentage of correctly identified trips that are not popular.
The charts below depict the importance of individual variables in the decision-making process of Random Forest models trained on the original dataset and the dataset after applying the SMOTE algorithm, respectively. In both cases, the most significant variables are "Stars," "Price," and the type of food - "breakfast." However, in case of the first model, the length of the trip, the number of destinations, and the departure date also play a significant role in the classification. In the second model, these variables are not that crucial.
feature_names = X_test.columns
fig, ax = plt.subplots(ncols=2, figsize=(20, 5))
importances = rf.feature_importances_
forest_importances = pd.Series(importances, index=X_test.columns)
result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=seed)
rf_importances = pd.Series(result.importances_mean, index=feature_names)
forest_importances.plot.bar(yerr=result.importances_std, ax=ax[0], alpha=0.7, title="Random Forest")
importances = rf_smote.feature_importances_
forest_importances = pd.Series(importances, index=X_test.columns)
result = permutation_importance(rf_smote, X_test, y_test, n_repeats=10, random_state=seed)
rf_importances = pd.Series(result.importances_mean, index=feature_names)
forest_importances.plot.bar(yerr=result.importances_std, ax=ax[1], alpha=0.7, title="Random Forest + SMOTE")
plt.show()
While the primary goal of building machine learning models is to predict outcomes based on input information as accurately as possible, the ability to understand how the classification occurred often proves equally important. To gain a better understanding of how the Random Forest models presented in this section make predictions, Partial Dependence Plots (PDP) and Ceteris Paribus (PCP) plots were used.
The Partial Dependence Plot (PDP) illustrates how the model's prediction changes for a specific instance by adjusting the value of a single variable, while keeping the other attributes constant. In a classification problem, the predicted probability of an object belonging to a certain class is affected.
To construct the PDP, an example trip was created, and the plot shows how the probability of it being classified as "popular" changes depending on the selected variables and the model used.
trip_1 = pd.DataFrame({
"stars": [4],
"duration": [5],
"price": [4000],
"multiple_destination": [0],
"discount": [0],
"greece": [1],
"egypt": [0],
"breakfast": [0],
"breakfast_dinner": [0],
"summer": [1],
})
rf_exp = dx.Explainer(rf, X_train_smote, y_train_smote, label="RF")
rf_smote_exp = dx.Explainer(rf_smote, X_train_smote, y_train_smote, label="RF with SMOTE")
cp_trip_1 = rf_exp.predict_profile(trip_1)
cp_smote_trip_1 = rf_smote_exp.predict_profile(trip_1)
Preparation of a new explainer is initiated -> data : 2658 rows 10 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 2658 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : RF -> predict function : <function yhat_proba_default at 0x00000195212293A0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.0817, mean = 0.312, max = 0.633 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.531, mean = 0.188, max = 0.903 -> model_info : package sklearn A new explainer has been created! Preparation of a new explainer is initiated -> data : 2658 rows 10 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 2658 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : RF with SMOTE -> predict function : <function yhat_proba_default at 0x00000195212293A0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.092, mean = 0.499, max = 0.837 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.78, mean = 0.000841, max = 0.883 -> model_info : package sklearn A new explainer has been created!
Calculating ceteris paribus: 100%|█████████| 10/10 [00:00<00:00, 33.43it/s] Calculating ceteris paribus: 100%|█████████| 10/10 [00:00<00:00, 39.08it/s]
The first thing to note is that the sample trip was classified into different classes depending on the model used. The first model classified it as unpopular (which is not surprising since this model classifies almost all trips this way), while the model trained on the dataset after applying the SMOTE algorithm classified it as popular. The plots show how the predictions of the models would change if the price or the number of stars assigned to the trip were altered. It turns out that the first model would classify the sample trip as popular if it had 5 stars. An increase in price by just under 1,000 would lead to a decrease in the probability of assignment to the positive class for both models by about 10%. However, in either case, it would not affect the final decision of the model.
cp_trip_1.plot(cp_smote_trip_1, variables = ['stars', 'price'])
The purpose of the Partial Dependence Plot (PDP) is to illustrate the marginal effect of changes in the value of a selected explanatory variable on the average prediction of the model. The method is analogous to the previously described Ceteris Paribus Plot but shows the model's behavior across the entire dataset, essentially representing the average of PDP plots calculated for all observations.
The presented PDP plots are very similar to the Ceteris Paribus plots shown above. This means that the effect of changes in the price and the number of stars on the model predictions for a sample instance is very close to the average effect determined for all observations. This is likely because the sample trip is representative, and its values do not deviate significantly from the average on the test set. Additionally, the model's decisions are likely influenced primarily by the variables presented in the plots.
pd_rf = rf_exp.model_profile(variables = ['stars', 'price'])
pd_rf_smote = rf_smote_exp.model_profile(variables = ['stars', 'price'])
# pd_rf = rf_exp.model_profile(groups = 'class', variables = ['age', 'fare'])
pd_rf.plot(pd_rf_smote)
Calculating ceteris paribus: 100%|███████████| 2/2 [00:00<00:00, 2.71it/s] Calculating ceteris paribus: 100%|███████████| 2/2 [00:00<00:00, 3.26it/s]
In the next step of the project, a gradient boosting classifiers model was built. In this method, a sample is drawn to build each subsequent tree, with a higher chance for objects that were previously misclassified. This allows the model to learn from its mistakes.
Since the algorithm allows for hyperparameter tuning, the first step was to conduct hyperparameter tuning to find their optimal subset. The same values as for the random forest were considered, with two additional parameters, "nsamples" and "learning rate" added.
The model was trained on the dataset after applying SMOTE because a significant improvement in the model's ability to identify positive class observations was observed in the previous part after using SMOTE.
The tuning procedure identified the following optimal hyperparameter set:
# liczba drzew
n_estimators = [int(x) for x in np.linspace(50, 300, 6, endpoint=True)]
# metoda obliczania max. liczby zmiennych do rozważenia przy każdym podziale drzewa
max_features = ['log2', 'sqrt']
# głębokość drzewa
max_depth = [int(x) for x in np.linspace(1, 4, 4, endpoint=True)]
# minimalna liczba próbek obecnych w liściu
min_samples_leaf = [int(x) for x in np.linspace(1, 11, 5, endpoint=True)]
# minimalna liczba obserwacji potrzebna do podziału węzła
min_samples_split = [int(x) for x in np.linspace(5, 55, 5, endpoint=True)]
# hiperparmetry wzmacniania gradientowego
subsample = [0.6, 0.7, 0.8, 0.9]
learning_rate = [0.005, 0.01, 0.05, 0.1, 0.2, 0.3]
random_grid = {
'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_leaf': min_samples_leaf,
'min_samples_split': min_samples_split,
'learning_rate': learning_rate,
'subsample': subsample
}
# gb_smote = GradientBoostingClassifier(random_state=seed)
# gb_smote_cv = RandomizedSearchCV(
# estimator = gb_smote,
# param_distributions = random_grid,
# n_iter = 150, cv=5, verbose=2,
# random_state=seed, n_jobs = -1
# )
# gb_smote_cv.fit(X_train_smote, y_train_smote)
# gb_smote_cv.best_params_
Fitting 5 folds for each of 150 candidates, totalling 750 fits
{'subsample': 0.9,
'n_estimators': 200,
'min_samples_split': 30,
'min_samples_leaf': 1,
'max_features': 'sqrt',
'max_depth': 4,
'learning_rate': 0.3}
# with open('gb_smote_cv.pickle', 'wb') as handle:
# pickle.dump(gb_smote_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('gb_smote_cv.pickle', 'rb') as handle:
gb_smote_cv = pickle.load(handle)
gb = GradientBoostingClassifier(
n_estimators=gb_smote_cv.best_params_['n_estimators'],
min_samples_split=gb_smote_cv.best_params_['min_samples_split'],
min_samples_leaf=gb_smote_cv.best_params_['min_samples_leaf'],
max_features=gb_smote_cv.best_params_['max_features'],
max_depth=gb_smote_cv.best_params_['max_depth'],
learning_rate=gb_smote_cv.best_params_['learning_rate'],
subsample=gb_smote_cv.best_params_['subsample'],
random_state=seed
)
gb.fit(X_train_smote, y_train_smote)
GradientBoostingClassifier(learning_rate=0.3, max_depth=4, max_features='sqrt',
min_samples_split=30, n_estimators=200,
random_state=123, subsample=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.3, max_depth=4, max_features='sqrt',
min_samples_split=30, n_estimators=200,
random_state=123, subsample=0.9)y_train_pred = gb.predict(X_train_smote)
y_pred = gb.predict(X_test)
gb_perf_metrics = evaluate_model(gb, X_test, y_test, X_train_smote, y_train_smote)
Evaluation measures on train set: Accuracy: 0.905 Error rate: 0.095 AUC: 0.905 Recall: 0.878 Specificity: 0.932 PPV: 0.928 NPV: 0.884 F1: 0.903 ---------------------------------------- Evaluation measures on test set: Accuracy: 0.69 Error rate: 0.31 AUC: 0.607 Recall: 0.403 Specificity: 0.811 PPV: 0.472 NPV: 0.764 F1: 0.435 ----------------------------------------
plot_conf_matrix(y_test, y_pred)
The above classifier achieved an accuracy of 0.69 on the test set, indicating that 69% of trips were correctly classified. This represents a significant improvement compared to the random forest model. The above printout and confusion matrix indicate that the model performs much better in predicting non-popular trips. This is evidenced by the low sensitivity and F1-score. A sensitivity of 40.3% indicates that only this percentage of popular trips was correctly identified by the model. Specificity, on the other hand, indicates that 81.1% of non-popular trips were correctly classified.
For the analysis of the decision-making process of the model created in this section, break-down plot and Shapley value plot for additive attributions were used.
The Break-Down plot helps understand the additive impact of individual variable values on the model's final decision by decomposing the predicted probability into components representing the contribution of each attribute. The plot aims to capture the contribution of each feature by illustrating the change in the average prediction caused by setting the values of each explanatory variable. Initially, the BDP shows the average model prediction for the entire dataset. Each subsequent level of the plot shows the average change in prediction caused by setting the value of the next independent variable. The last row contains the sum of the average model prediction and the contributions of individual variables, i.e., the predicted probability value for the considered observation.
To construct both plots, the artificial trip offer constructed in the previous section was used.
gb_exp = dx.Explainer(gb, X_train_smote, y_train_smote, label="Gradient Boosting Trees")
bd_trip = gb_exp.predict_parts(trip_1, type='break_down')
Preparation of a new explainer is initiated -> data : 2658 rows 10 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 2658 values -> model_class : sklearn.ensemble._gb.GradientBoostingClassifier (default) -> label : Gradient Boosting Trees -> predict function : <function yhat_proba_default at 0x00000195212293A0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 8.41e-05, mean = 0.501, max = 1.0 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.787, mean = -0.00134, max = 0.989 -> model_info : package sklearn A new explainer has been created!
The chart below shows the contribution of all variables to the final prediction of the model. We can observe that the model considered the sample trip as popular with a probability of 56.5%. The most significant positive impact (increasing the likelihood of classifying the trip as popular) can be attributed to the trip's price of 4000 PLN, a star rating of 4, and a duration of 5 days. The probability is significantly reduced by the fact that the trip is to Greece and takes place in the summer season.
bd_trip.plot()
The Shapley plot for average attributions solves the issue present in the case of break-down plots, i.e., the dependency of the impact of individual variables on their order on the chart. The Shapley plot addresses the problem of dependence on the order of contributions by averaging the contributions of a variable for all (or many) possible orders.
In the chart below, we can see that the contribution of most variables is similar to that presented in the break-down plot. An important difference is the number of stars, which turned out to have a much less significant impact (and, moreover, negative) than indicated by the above chart. Instead, the type of meals and the fact that the price is not discounted turned out to be significant for classification decision.
shap_trip_1 = gb_exp.predict_parts(trip_1, type='shap')
shap_trip_1.plot()
def update_master_results(model_names, model_version, models_pm, master_results):
for i, model_name in enumerate(model_names):
test_pm = models_pm[i].get("Test")
acc = test_pm.get("Accuracy")
recall = test_pm.get("Recall")
specificity = test_pm.get("Specificity")
ppv = test_pm.get("PPV")
f1 = test_pm.get("F1")
new_row = [model_name, model_version, acc, recall, specificity, ppv, f1]
master_results.loc[len(master_results)] = new_row
return master_results
# data frame do porównania wyników klasyfikacji
master_results = pd.DataFrame(columns=["Model", "Class imbalance approache", "Accuracy", "Recall", "Specificity", "Precision", "F1"])
model_names = ["Las losowy"]
model_version = "-"
models_pm = [rf_perf_metrics]
master_results = update_master_results(model_names, model_version, models_pm, master_results)
model_names = ["Las losowy", "Gradient Boosting Trees"]
model_version = "SMOTE"
models_pm = [rf_smote_perf_metrics, gb_perf_metrics]
master_results = update_master_results(model_names, model_version, models_pm, master_results)
The table below presents the classification results of all three models. Among all classifiers, Gradient Boosting Trees with the SMOTE algorithm achieves the best results. The algorithm correctly classifies 69% of observations, but accurately identifies only 40.3% of trips that are popular among the agency's clients. Slightly better in identifying objects belonging to the positive class is Random Forest with the SMOTE algorithm, which achieves an accuracy of 59.5%, but correctly identifies 61.3% of popular trips. On the other hand, compared to Gradient Boosting Trees, it performs worse in detecting unpopular trips. Random Forest without the SMOTE algorithm performs the worst.
master_results
| Model | Class imbalance approache | Accuracy | Recall | Specificity | Precision | F1 | |
|---|---|---|---|---|---|---|---|
| 0 | Las losowy | - | 0.700 | 0.032 | 0.980 | 0.400 | 0.060 |
| 1 | Las losowy | SMOTE | 0.595 | 0.613 | 0.588 | 0.384 | 0.472 |
| 2 | Gradient Boosting Trees | SMOTE | 0.690 | 0.403 | 0.811 | 0.472 | 0.435 |
def spider_plot(metrics_lists, labels, plot_title):
categories = ['Accuracy', 'Recall', 'Specificity', 'Precision', 'AUC']
N = len(categories)
angles = [n / float(N) * 2 * np.pi for n in range(N)]
angles += angles[:1]
plt.figure(figsize=(10, 10))
ax = plt.subplot(111, polar=True)
ax.set_theta_offset(np.pi / 2)
ax.set_theta_direction(-1)
plt.xticks(angles[:-1], categories, size=15)
ax.set_rlabel_position(0.3)
plt.yticks([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], color="grey", size=12)
plt.ylim(0, 1)
colors = ["blue", "green", "red", "black"]
for i in range(len(metrics_lists)):
values = metrics_lists[i]
values += values[:1]
ax.plot(angles, values, linewidth=1.1, linestyle='solid', label=labels[i], color=colors[i])
plt.title(plot_title, size=20, y=1.05)
plt.legend(prop={'size': 15}, loc='lower left')
plt.show()
metrics_lists = [list(r)[2:] for r in master_results.values.tolist()]
labels = ["Las losowy", "Las losowy + SMOTE", "Gradient Boosting Trees + SMPTE"]
plot_title = "Porównanie modeli"
spider_plot(metrics_lists, labels, plot_title)
The aim of this project was to build machine learning classifiers capable of predicting with satisfactory accuracy the popularity of Rainbow Tour's summer 2023 offerings. As a result, three models were built:
The model that achieved the best classification quality was Gradient Boosting classifier in combination with the SMOTE algorithm, with an accuracy of 69%, sensitivity of 40%, and specificity of 81%. It was observed that without oversampling, the models classified a significant majority of trips as unpopular. Thanks to the application of SMOTE, the models were able to better identify observations belonging to the positive class. The Gradient Boosting model achieved a worse classification measure only for sensitivity compared to the Random Forest. The Random Forest model with the SMOTE algorithm has about 20% better ability to correctly classify popular trips among customers.
In addition to achieving high classification quality, the project aimed at interpreting the built models to understand which parameters of trips influence their popularity among agency customers and how. The project led to the following conclusions:
Moreover, the following action could be taken in order to improve the model's performance: